arXiv:1508.05328v2 [cs.MA] 31 Mar 2016 


1 


Multi-agent Reinforcement Learning with Sparse Interactions by 
Negotiation and Knowledge Transfer 

Luowei Zhou, Pei Yang, Chunlin Chen, Member, IEEE, Yang Gao, Member, IEEE 


Abstract —Reinforcement learning has significant applications 
for multi-agent systems, especially in unknown dynamic en¬ 
vironments. However, most multi-agent reinforcement learning 
(MARL) algorithms suffer from such problems as exponential 
computation complexity in the joint state-action space, which 
makes it difficult to scale up to realistic multi-agent problems. 
In this paper, a novel algorithm named negotiation-based MARL 
with sparse Interactions (NegoSI) is presented. In contrast to 
traditional sparse-interactlon based MARL algorithms, NegoSI 
adopts the equilibrium concept and makes it possible for agents 
to select the non-strict Equilibrium Dominating Strategy Profile 
(non-strict EDSP) or Meta equilibrium for their joint actions. 
The presented NegoSI algorithm consists of four parts: the 
equilibrium-based framework for sparse interactions, the nego¬ 
tiation for the equilibrium set, the minimum variance method 
for selecting one joint action and the knowledge transfer of 
local Q-values. In this Integrated algorithm, three techniques, 
i.e., unshared value functions, equilibrium solutions and sparse 
interactions are adopted to achieve privacy protection, better 
coordination and lower computational complexity, respectively. 
To evaluate the performance of the presented NegoSI algo¬ 
rithm, two groups of experiments are carried out regarding 
three criteria: steps of each episode (SEE), rewards of each 
episode (REE) and average runtime (AR). The first group of 
experiments is conducted using six grid world games and shows 
fast convergence and high scalability of the presented algorithm. 
Then in the second group of experiments NegoSI is applied to an 
intelligent warehouse problem and simulated results demonstrate 
the effectiveness of the presented NegoSI algorithm compared 
with other state-of-the-art MARL algorithms. 

Index Terms —Knowledge transfer, multi-agent reinforcement 
learning, negotiation, sparse interactions. 

I. Introduction 

Multi-agent learning is drawing more and more interests 
from scientists and engineers in multi-agent systems (MAS) 
and machine learning communities ||Tl-||4l. One key technique 
for multi-agent learning is multi-agent reinforcement learning 
(MARL), which is an extension of reinforcement learning in 
multi-agent domain 0. Several mathematical models have 
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been built as frameworks of MARL, such as Markov games 
(MG) ||6l and decentralized sparse-interaction Markov decision 
processes (Dec-SIMDP) Q. Markov games are based on the 
assumption of full observability of all agents in the entire 
joint state-action space. Several well-known equilibrium-based 
MARL algorithms 0- lfT^ are derived from this model. Dec- 
SIMDP based algorithms rely on agents’ local observation, 
i.e., the individual state and action. Agents in Dec-SIMDP 
are modeled with single-agent MDP when they are outside 
of the interaction areas, while the multi-agent model such as 
MG is used when they are inside. Typical Dec-SIMDP based 
algorithms include LoC lfT3]l and CQ-learning lfT4l . Besides, 
other models such as learning automata El ini are also 
valuable tools for designing MARL algorithms. 

In spite of the rapid development of MARL theories and 
algorithms, more efforts are needed for practical applications 
of MARL when compared with other MAS techniques m- 
Ea due to some limitations of the existing MARL methods. 
The equilibrium-based MARL relies on the tightly coupled 
learning process which hinders their applications in practice. 
Calculating the equilibrium (e.g., Nash equilibrium |[T9l ) for 
each time step and all joint states are computationally ex¬ 
pensive lUl, even for relatively small scale environments with 
two or three agents. In addition, sharing individual states, 
individual actions or even value functions all the time with 
other agents is unrealistic in some distributed domains (e.g., 
streaming processing systems ll20l , sensor networks ED) 
given the agents’ privacy protections and huge real-time com¬ 
munication costs ID. As for MARL with sparse interactions, 
agents in this setting have no concept of equilibrium policy 
and they tend to act aggressively towards their goals, which 
results in a high probability of collisions. 

Therefore, in this paper we focus on how the equilibrium 
mechanism can be used in sparse-interaction based algo¬ 
rithms and a negotiation-based MARL algorithm with sparse 
interactions (NegoSI) is proposed for multi-agent systems 
in unknown dynamic environments. The NegoSI algorithm 
consists of four parts; the equilibrium-based framework for 
sparse interactions, the negotiation for the equilibrium set, 
the minimum variance method for selecting one joint action 
and the knowledge transfer of local Q-values. Firstly, we start 
with the proposed algorithm based on the MDP model and the 
assumption that the agents have already obtained the single¬ 
agent optimal policy before learning in multi-agent settings. 
Then, the agents negotiate for pure strategy profiles as their 
potential set for the joint action at the “coordination state” and 
they use the minimum variance method to select the relatively 
good one. After that, the agents move to the next joint state 
and receive immediate rewards. Finally, the agents’ Q-values 
are updated by their rewards and equilibrium utilities. When 
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initializing the Q-value for the expanded “coordination state”, 
the agents with NegoSI utilize both of the environmental infor¬ 
mation and the prior-trained coordinated Q-values. To test the 
effectiveness of the proposed algorithm, several benchmarks 
are adopted to demonstrate the performances in terms of 
the convergence, scalability, fairness and coordination ability. 
In addition, aiming at solving realistic MARL problems, 
the presented NegoSI algorithm is also further tested on an 
intelligent warehouse simulation platform. 

The rest of this paper is organized as follows. Section ||I] 
introduces the basic MARL theory and the MAS with sparse 
interactions. In Section Hill the negotiation-based MARL al¬ 
gorithm with sparse interactions (NegoSI) is presented in 
details and related issues are discussed. Experimental results 
are shown in Section HVl Conclusions are given in Section IV] 

11. BACKGROUND 

In this section, some important concepts in multi-agent 
reinforcement learning and typical sparse-interaction based 
MARL algorithms are introduced. 

A. MDP and Markov Games 

We start with reviewing two standard decision-making 
models that are relevant to our work, i.e., Markov decision 
process (MDP) and Markov Games, respectively. MDP is the 
foundation of Markov Games, while Markov Games adopt 
game theory in multi-agent MDPs. MDP describes a sequential 
decision problem as follows ll 22 ll : 

Definition 1: (Markov Decision Process, MDP) An MDP 
is a tuple {S,A,R,T), where S is the state space, A is the 
action space of the agent, R: SxA^M.is the reward function 
mapping state-action pairs to rewards, T ; 5 x A x 5 ^ [0,1] is 
the transition function. 

An agent in an MDP is required to find an optimal policy 
which maximizes some reward-based optimization criteria, 
such as expected discounted sum of rewards: 

y*(i) =max£';i:{^=i}, (1) 

k^O 

where y*(i) stands for the value of a state s under the optimal 
policy, TT : 5 X A —[0,1] stands for the policy of an agent, E;^ is 
the expectation under policy n, t is any time step, k represents 
a future time step, denotes the reward at the time step 
{t +k) and 7 G [0,1] is a parameter called the discount factor. 
This goal can also be equivalently described using the Q-value 
for a state-action pair: 

Q*{s,a) = r{s,a) + yyT{s,a,s')maxQ*{s',a'), (2) 

.v' 

where Q*{s,a) stands for the value of a state-action pair {s,a) 
under the optimal policy, s' is the next state and r{s,a) is 
the immediate reward when agent adopts the action a at the 
state s, T{s,a,s') is the transition possibility for the agent to 
transit from s to s' given action a. One classic RL algorithm 
for estimating Q*{s,a) is Q-learning 12^ . whose one-step 
updating rule is as follows: 


Q{s,a) ^ (1 — a)Q(s,a) -f a[r(i,a) -f /maxQ(/,«')], (3) 

a’ 

where Qis^a) denotes the state-action value function at a state- 
action pair {s,a) and a G [ 0 , 1 ] is a parameter called the 
learning rate. Provided that all state-action pairs are visited 
infinite times with a reasonable learning rate, the estimated 
Q-value Q(s,a) converges to Q*(s,a) li24l . 

Markov games are widely adopted as a framework for multi¬ 
agent reinforcement learning (MARL) 161 ifTOl . It is regarded 
as multiple MDPs in which the transition probabilities and 
rewards depend on the joint state-action pairs of all agents. In a 
certain state, agents’ individual action sets generate a repeated 
game that could be solved in a game-theoretic way. Therefore, 
Markov game is a richer framework which generalizes both 
of the MDP and the repeated game li^ - lZTl . 

Definition 2: (Markov game) An «-agent (n > 2) Markov 
game is a tuple («,{A,},=p.,._„,T), 
where n is the number of agents in the system. Si is the set 
of the state space for agent, S = is the set of 

state spaces for all agents. A,- is the set of the action space 
for i'^ agent, A = {A,},=i „ is the set of action spaces for 

all agents, R, : 5 x A —R is the reward function of agent i, 
r : 5 X A X 5 —>■ [0,1] is the transition function. 

Denote the individual policy of agent i by Jti = S x Ai 
[ 0 , 1 ] and the joint policy of all agents by n ~ {n\,... ,nn)- 
The Q-value of the join state-action pair for agent i under the 
joint policy n can be formulated by 

Q4{s,d) = Ej,{yffi^+''\^ = s,d‘ = d}, ( 4 ) 

k=0 

where s€S stands for a joint state, a G A for a joint action and 
is the reward received at the time step (t + k). Unlike the 
optimization goal in an MDP, the objective of a Markov game 
is to find an equilibrium joint policy n rather than an optimal 
joint policy for all agents. Here, the equilibrium policy concept 
is usually transferred to finding the equilibrium solution for the 
one-shot game played in each joint state of a Markov game 
0 . Several equilibrium-based MARL algorithms in existing 
literatures such as NashQ Eo) EHl and NegoQ m have been 
proposed, so that the joint state-action pair Q-value can be 
updated according to the equilibrium: 

Qi{s,a) <-{l- a)Qi{s,a) + a(r, •(?,«) -f 70,■(^)))) (5) 

where 0 ,(?') denotes the expected value of the equilibrium in 
the next joint state S' for agent i and can be calculated in the 
one-shot game at that joint state. 

B. MAS with Sparse Interactions 

The definition of Markov game reveals that all agents need 
to learn their policies in the full joint state-action space and 
they are coupled with each other all the time IfTOll Il29l . How¬ 
ever, this assumption does not hold in practice. The truth is 
that the learning agents in many practical multi-agent systems 
are loosely coupled with some limited interactions in some 
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Fig. 1; A part of an intelligent warehouse with three robots, where 
Ri (i = 1,2,3) represents for robot i with its corresponding goal Gi 
and the shaded grids are storage shelves. 


particular areas a 03 im. Meanwhile, the interactions 
between agents may not always involve all the agents. These 
facts lead to a new mechanism of sparse interactions for 
MARL research. 

Sparse-interaction based algorithms II3 M have recently 
found wide applications in MAS research. An example of 
MAS with sparse interactions is the intelligent warehouse 
systems, where autonomous robots only consider other robots 
when they are close enough to each other ll25l (see Fig. [U, 
i.e., when they meet around the crossroad. Otherwise, they can 
act independently. 

Earlier works such as the coordinated reinforcement learn¬ 
ing iMl El and sparse cooperative Q-learning ll^ used 
coordination graphs (CGs) to learn interdependencies between 
agents and decomposed the joint value function to local value 
functions. However, these algorithms cannot learn CGs online 
and only focus on finding specific states where coordination is 
necessary instead of learning for coordination ca. Melo and 
Veloso ca extended the Q-learning to a two-layer algorithm 
and made it possible for agents to use an additional COORDI¬ 
NATE action to determine the state when the coordination was 
necessary. Hauwere and Vrancx proposed Coordinating Q- 
learning (CQ-learning) iflTl and ECQ-learning Jia that helped 
agents learn from statistical information of rewards and Q- 
values where an agent should take other agents’ states into 
account. However, all these algorithms allow agents to play 
greedy strategies at a certain joint state rather than equilibrium 
strategies, which might cause conflicts. 

More recently, Hu et al 0 proposed an efficient 
equilibrium-based MARL method, called Negotiation-based 
Q-learning, by which agents can learn in a Markov game with 
unshared value functions and unshared joint state-actions. In 
later work, they applied this method for sparse interactions 
by knowledge transfer and game abstraction 1291 . and demon¬ 
strated the effectiveness of the equilibrium-based MARL in 
solving sparse-interaction problems. Nevertheless, as opposed 
to single Q-learning based approaches like CQ-learning, Hu’s 
equilibrium-based methods for sparse interactions require a 
great deal of real-time information about the joint states 


and joint actions of all the agents, which results in huge 
amount of communication costs. In this paper, we focus 
on solving learning problems in complex systems. Tightly 
coupled equilibrium-based MARL methods discussed above 
are impractical in these situations while sparse-interaction 
based algorithms tend to cause many collisions. To this end, 
we adopt the sparse-interaction based learning framework and 
each agent selects equilibrium joint actions when they are in 
coordination areas. 

III. NEGOTIATION-BASED MARL WITH SPARSE 
INTERACTIONS 

When people work in restricted environments with possible 
conflicts, they usually learn how to finish their individual 
tasks first and then learn how to coordinate with others. We 
apply this commonsense to our sparse-interaction method and 
decompose the learning process into two distinct sub-processes 
0 . Eirst, each agent learns an optimal single-agent policy by 
itself in the static environment and ignores the existences of 
other agents. Second, each agent learns when to coordinate 
with others according to their immediate reward changes, and 
then learns how to coordinate with others in a game-theoretic 
way. In this section, the negotiation-based framework for MAS 
with sparse interactions is first introduced and then related 
techniques and specific algorithms are described in details. 

A. Negotiation-based Framework for Sparse Interactions 

We assume that agents have learnt their optimal policies and 
reward models when acting individually in the environment. 
Two situations might occur when agents are working in a 
multi-agent setting. If the received immediate rewards for 
state-action pairs are the same as what they learned by reward 
models, the agents act independently. Otherwise, they need to 
expand their individual state-action pairs to the joint ones by 
adding other agents’ state-action pairs for better coordination. 
This negotiation-based framework for sparse interactions is 
given as shown in Algorithm [T] while the expansion and 
negotiation process is explained as follows: 

1) Broadcast joint state. Agents select an action at a certain 
state, and they detect a change in the immediate rewards. 
This state-action pair is marked as “dangerous” and 
these agents are called “coordinating agents”. Then, 
“coordinating agents” broadcast their state-action infor¬ 
mation to all other agents and receive corresponding 
state-action information from others. These state-action 
pairs with reward changes form a joint state-action and 
is marked as a “coordination pair”. Also, these states 
form a joint state called a “coordination state”, which is 
included in the state space of each “coordinating agent”. 

2) Negotiation for equilibrium policy. When agents select a 
“dangerous” state-action pair, they broadcast their state- 
action information to each other to determine whether 
they are staying at a “coordination pair”. If so, the 
agents need to And an equilibrium policy rather than 
their inherent greedy policies to avoid collisions. We 
propose a negotiation mechanism similar to the work in 
0 to And this equilibrium policy. Each “coordinating 
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agent” broadcasts the set of strategy profiles that are 
potential Non-strict Equilibrium-Dominating Strategy 
Profile (non-strict EDSP) according to its own utilities 
(See Algorithm [2]i. If no non-strict EDSP is found, the 
agents search for a Meta equilibrium set (See Algorithm 
|3ll instead, which is always nonempty [SI. Then, if there 
are several candidates in the equilibrium set, the agents 
use the minimum variance method to find the relatively 
good one (See Algorithm |4]i. 

3) When an agent selects an action at a certain state, if 
neither a change in the immediate reward nor a “dan¬ 
gerous” state-action pair is detected, the agent continue 
to select its action independently. 

Remark 1: If the number of agents n > 2, the agents may 
have several different expansions for different “coordination 
states” in one state. Each expanded joint-state is assigned an 
index and has a corresponding local Q-value. Once the agents 
form a “coordination state”, they search for the joint state in 
their expanded joint-state pool and obtain the corresponding 
local Q-value. An example is shown as in Eig. |2] In this 
example, agent 1-4 have twelve states, four states, five states 
and three states, respectively. Eor convenience, we assume 
here when agent 1 is expanding joint states, it adds all states 
of relevant agents in its joint states even though in most of 


the cases it only needs to add a small part of the states. 
Specifically, the fourth state of agent 1 is expanded to two 
joint states, the first one is with agent 4 and the second one 
is with agent 2 and agent 4. Similarly, the fifth state of agent 
1 is expanded to three joint states, the first one is with agent 
2, the second one is with agent 2 and agent 3, and the third 
one is with agent 3. 

Remark 2: Owing to the broadcast mechanism, an agent 
will add the states of all the “coordinating agents” to its state 
space, even though it does not need to coordinate with some 
of these agents. The agents cannot distinguish the agent they 
need to coordinate with from other agents, which brings un¬ 
necessary calculations. This problem becomes more significant 
with the growth of the complexity of the environment and 
the number of the agents. Our tentative solution is setting 
the communication range of the agents so that they can only 
add neighbouring agents’ states to their state space and ignore 
those that do not need to coordinate with. 

Remark 3: It is worth noting that if the agent detects 
a “dangerous” state-action pair, it should observe a reward 
change, that is, if we replace the former concept with the latter 
one, the algorithm can work equally. Defining dangerous state- 
action pairs mainly help us to better explain our thoughts and 
describe the algorithm. 


Algorithm 1 Negotiation-based Eramework for Sparse Interactions 

Input: The agent i, state space Si, action space A,-, learning rate a, discount rate 7 , exploration factor £ for the £ — Greedy 
exploration policy. 

Initialize: Global Q-value Qt with optimal single-agent policy. 

1 : for each episode do 
2 : Initialize state s,-; 

3: while true do 

4: Select a, € A, from Qi with £ — Greedy, 

5: if (si,ai) is “dangerous” then 

6 : Broadcasts isi,af) and receives (s^i,a^i), form (i,a); 

/* See Section UlI-DI for the definitions of and a_,*/ 

7: if (s,a) is not “coordination pair” then 

8 : Mark (s, a) as a “coordination pair” and ? as a “coordination state”, initialize Local Q-value Q'j at s with transfer 

knowledge (See Equation |9]i; 

9: end if 

10 : Negotiate for the equilibrium joint action with Algorithmic] - Algorithm |4] Select new a with e — Greedy, 

11 : else {detected an immediate reward change} 

12 : Mark (si,ai) as “dangerous”, broadcasts {si,ai) and receives {s-i,a-i), form (s,a); 

13: Mark (?,a) as “coordination pair” and s as “coordination state”, initialize 2/ at s with transfer knowledge (See 

Equation |9]i; 

14: Negotiate for the equilibrium joint action with Algorithmic] - Algorithm 0] Select new a with e — Greedy, 

15: end if 

16: Move to the next state s\ and receive the reward r,; 

17: if s exists then 

18: Update 2/(s,a) = (1 - a)2f -|- a(r,'-t- 7 max 2 ,(s.,a 9 ); 

“i 

19: end if 

20 : Update Qi{si,ai) = - a)Qi{si,ai) + a{ri + ym^,y,Qi{s'i,a'^))■, 

“'i 

21 : Si i 

22 : end while until s,- is a terminal state. 

23: end for 
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Fig. 2: An example of the expansion of an agent’s state space. 


B. Negotiation for Equilibrium Set 

One contribution of this paper is to apply the equilibrium 
solution concept to traditional Q-learning based MARL algo¬ 
rithms. Different from previous work like CQ-learning and 
Learning of Coordination na, our approach aims at finding 
the equilibrium solution for the one-shot game played in 
each “coordination state”. As a result, we focus on two pure 
strategy profiles ISI, i.e., Non-strict Equilibrium-Dominating 
Strategy Profile (non-strict EDSP) and Meta Equilibrium set. 
The definition of Non-strict EDSP is as follows. 

Definition 3: {Non-strict EDSP): In an n-agent {n > 2) 
normal-form game E, e, €A{i= are pure strategy 

Nash equilibriums. A joint action a S A is a non-strict EDSP 
if Vj < n, 

Uj{a) >mmUj{ei) (6) 

i 

The negotiation process of finding the set of Non-strict 
EDSP is shown as in Algorithm^ which generally consists of 
two steps; 1 ) the agents find the set of strategy profiles that are 
potentially Non-strict EDSP according to their individual util¬ 
ities; 2 ) the agents solve the intersection of all the potentially 
strategy sets and get the Non-strict EDSP set. 

However, given the fact that the pure strategy Nash equi¬ 
librium can be non-existent in some cases, the set of Non- 
strict EDSP might also be empty. On this occasion, we replace 
this strategy profile with Meta Equilibrium, which is always 
nonempty. The sufficient and necessary condition of Meta 
Equilibrium is defined as follows. 

Definition 4: {Sufficient and Necessary Condition of 
Meta Equilibrium) 0: In an n-agent {n > 2) normal-form 
game F, a joint action a is called a Meta equilibrium from 
a metagame kiki .. .k^-F if and only if for any i there holds 

Ui {a) > min max min t/,- [dp. ,ai, dsj) i (7) 

af a, ds- 

where P, is the set of agents listed before sign i in the prefix 
k\k 2 . ■ ■ kr. Si is the set of agents listed after sign i in the prefix. 

For example, in a three agents metagame 213F, we have 
Pi ={2},5i = {3};P2 = 0,52 = {1,3};P3 = {2,1},53 = 0. A 
Meta equilibrium a meets the following constraints: 


U\(d) > minmaxmint/i(ai,a2,a3), 

02 Oi 02 

L2(a) > maxminmint/2(ai,a2,a3)) ("S') 

a2 a\ ' 

U^id) > min min max t/3 (a 1,(22, <23 )■ 

€12 ^1 

Hu et al 0 used a negotiation-based method to find Meta 
Equilibrium set and we simplified the method as shown in 
Algorithm [3 It is also pointed out that both of these two 
strategy profiles belong to the set of symmetric meta equilib¬ 
rium, which to some degree strengthens the convergence of 
the algorithm. 


Algorithm 2 Negotiation for Non-strict EDSPs Set 

Input: A normal-form game («,{A,},=p {t/,},=i^.,.^„). 

/* To be noted, “coordinating agent” i only has the 
knowledge of n, {A,},=i „ and t/,*/ 

Initialize: The set of non-strict EDSP candidates for “coor¬ 
dinating agents” i: 7^^ <— 0 ; 

Minimum utility value of pure strategy Nash equilibrium 
(PNE) candidates for “coordinating agents” i: MinUpj^p 4— 

00* 

The set of non-strict EDSPs: <r- 0. 

1 : for each dCi do 

2 : if maxUfa'^jdCi) <MinUp^^ then 

3: MinUp^p = max Ui [a[, d2i ); 

4: end if 

5: end for 

6 : for each (f e A do 

7 : if Ufa) > MinUpfjp then 

/vs ^ '^A's {d }; 

9: end if 

10 : end for 

/* Broadcast and corresponding utilities*/ 

11: Jns ^ ri/Ll-^vs 
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C. Minimum Variance Method 

In Section IIII-Bl we presented the process for all “coordi¬ 
nating agents” to negotiate for an equilibrium joint-action set. 
However, the obtained set usually contains many strategy pro¬ 
files and it is difficult for the agents to choose an appropriate 
one. In this paper a Minimum Variance method is proposed 
to help the “coordinating agents” choose the joint action with 
relatively high total utility and minimum utilities variance to 
guarantee the cooperation and fairness of the learning process. 
In addition, if the equilibrium set is nonempty, the best solution 
defined in the minimum variance method always exists. The 
minimum variance method is described in Algorithm H) 

After negotiating for one equilibrium, the agents update 
their states according to the joint action and receive immediate 
rewards, which are used to update local Q-values as well 
as global Q-values. Unlike other sparse-interaction algorithms 
(e.g., CQ-learning), we update the global optimal Q-values to 
avoid possible misleading information. In fact, in some cases, 
the selected policies in multi-agent setting are totally opposite 
to the agents’ individual optimal policies due to dynamic 
coordinating processes. The whole negotiation-based learning 
algorithm has already been given in Algorithm [T] 

D. Local Q-value transfer 

At the beginning of the learning process, we use the transfer 
knowledge of agents’ optimal single agent policy to accelerate 
the learning process. Furthermore, it is possible to improve 
the algorithm performance by the local Q-value transfer. In 
most previous literatures ina m, the initial local Q-values 
of the newly expanded joint states are zeros. Recently, Vrancx 
et al proposed a transfer learning method M to initialize 
these Q-values with prior trained Q-value from the source 
task, which is reasonable in the real world. When people 
meet with others on the way to their individual destinations. 


Algorithm 3 Negotiation for Meta Equilibrium Set for 3-agent 
games 

Input: A normal-form game < n,{A,},=i {1/,},=! >. 

/* To be noted, “coordinating agents” i only has the 
knowledge of n, „ and Ui*/ 

Initialize: The set of Meta Equilibrium candidates for “co¬ 
ordinating agents” i: J'^etaE ^ 

Minimum utility value of Meta Equilibrium candidates for 
“coordinating agents” i: MinUlf^f^,^ •(— 

The set of Meta Equilibrium: JMetaE ^ 0; 

1 : Randomly initialize the prefix siS 2 S 2 from the set 
{123,132,213,231,312,321}. 

2 : Calculate MinUh^j^i- according to Equation |7j 

3: for each a e A do 

4: if Ufa) >MinUf,^i^^ then 

•^MetaE •^MetaE U {a}, 

6 : end if 

7: end for 

/* Broadcast corresponding utilities*/ 

8 : JhietaE ^ V\i=\JMetaE 


Algorithm 4 Minimum Variance Method 

Input: The equilibrium set with m elements 7^5 = 

and corresponding utilities {!//“}, 

Initialize: threshold value for total utility t; 

/* We set the threshold value to the mean value of the_sum 

ym ) 

utilities of different joint-action profiles ' —— */ 

Minimum variance of these equilibriums MinV -h- 0 °; 

Best equilibrium of the non-strict EDSPs set Jbcsins 0- 
1: for each ai^ £ Jns do 
2 : if I}^i[/“(fll)<Tthen 

3: Jns ^ JNs\Wns}', 

4: end if 

5: end for 

/* Minimize the joint action’s utility variance*/ 

6: for ea ch ajs € Jns do _ 

7: if {/^ ILi [ Uk^jaL) - -n If=i UrUs)? < MinV then 

8 : MinV = i ILi [U-faL) - i t/”(«!)]'; 

9: JsestNS ~ {^ni}: 

10 : end if 

11 : end for 

12 : Output JsestNS 4 s the adopted joint action. 


they usually have prior knowledge of how to avoid collisions. 
Based on this prior commonsense and the knowledge of how 
to finish their individual tasks, the agents learn to negotiate 
with others and obtain fixed coordination strategies suitable 
for certain environments. However, Vrancx et al only used 
coordination knowledge to initialize the local Q-values and 
overlooked the environmental information, which was proved 
to be less effective in our experiments. In our approach, we 
initialize the local Q-values of newly detected “coordination” 
to hybrid knowledge as follows: 


Ql{{si,sLi),{ai,aLi)) ^ (s,a), (9) 

where sL, is the joint-state set except for the state st 
and aUi is the joint-action set except for the action a,-, 
Oj{{si,sLi),{ai,dLi)) is equal with Qf{s^d), Qf^{s,a) is the 
transferred Q-value from a blank source task at joint state 
s. In this blank source task, joint action learners (JAL) ll35]l 
learn to coordinate with others disregarding the environmental 
information. They are given a fixed number of learning steps to 
learn stable Q-value Qf^{s,a). Similar to ll34l . the joint state 
s in Qf^{s,d) is presented as the relative position (Ar,Ay), 
horizontally and vertically. When agents attempt to move into 
the same grid location or their previous locations (as shown in 
Fig. 0, these agents receive a penalty of -10. In other cases, 
the reward is zero. 

Take a two-agent source task for example (as shown in Eig. 
Ufa)). The locations of agent 1 and agent 2 are (4,3) and (3,4), 
respectively. Then we have the joint state s= (tci —X 2 ,yi — 
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j 2 ) = (1,-1) and the joint action for this joint state 


a= ((fli,fl 2 )) 


( (t,t) 

(44) 

(t,H 


a,t) 

(4,4) 

(4,^) 

(4,^) 

(^,t) 

(•^,4) 



V(^>t) 

(“^,4) 




where denotes the action set of {up, down, left, 

right} for each agent. After sufficient learning steps, the agents 
learn their Q-value matrices at joint state s as 




/O 0 -10 0\ 

0 0 0 0 

0 0 0 0 ’ 

\0 -10 0 0 / 


that is because when agent 1 and agent 2 select the action 
pair or (—>-,4,), a collision will occur, which leads to 

a -10 punishment for each agent. Otherwise, the reward is 
0. Suppose that the agents are in the “coordination state” 
as shown in Fig. Htb). When acting independently in the 
environment, the agents learn their single agent optimal Q- 
value vectors at state sj or *2 as (si, -) = (—1, —10, —5, — 1), 
Q 2 (,S 2 ,-) = (—10,—1, —1,-5). Then the local Q-value matri¬ 
ces of this “coordination state” need to be initialized as 

e{(?,-) = ei(*i,-)^x(i,i,i,i)+ef^(5,.) 


/-I 

-1 

-11 

-r 

\ 

-10 

-10 

-10 

-10 


-5 

-5 

-5 

-5 

1 

V-i 

-11 

-1 

-1. 


(1,1,1,1) 

T 

X 

22 (^ 2 ,-)+22^(4, •) 

/-lO 

-1 

-11 



-10 

-1 

-1 

-5 


-10 

-1 

-1 

-5 


V -10 

-11 

-1 

-5/ 



So the pure strategy Nash equilibria for this joint state are 
{up, down) and (right, left). If we initialize the local Q-value 
with the way used in 041 or just initialize them to zeros, the 
learning process would be much longer. 

For the “coordination state” with three agents, the Q-value 
Qf^{s,d) can be calculated in the same way, except for the 
relative positions (Axi,Ayi,Ar 2 ,Ay 2 ) = (.ri -X 2 ,y\ -y 2 ,X 2 - 
X 3 ,y 2 —ys) and the cubic Q-value for each joint state. 


Collision scenario I: 


move right move left 


Collision seenario II: 



move right move left 


Fig. 3: Two scenarios of possible collisions. 
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(a) A 5 X 5 blank source task with two agents. 
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(b) The detected “coordination states”. 


Fig. 4; An example of the local Q-value transfer in a two-agent 
system. 
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IV. EXPERIMENTS 

To test the presented NegoSI algorithm, several groups of 
simulated experiments are implemented and the results are 
compared with those of three other state-of-the-art MARL 
algorithms, namely, CQ-learning m, NegoQ with value 
function transfer (NegoQ-VET) ll29l and independent learners 
with value function transfer (IL-VET). In next two subsections, 
the presented NegoSI algorithm is applied to six grid world 
games and an intelligent warehouse problem, which shows 
that NegoSI is an effective approach for MARL problems 
compared with other existing MARL algorithms. 

Eor all these experiments, each agent has four actions: up, 
down, left, right. The reward settings are as follows: 

1) When an agent reaches its goaPgoals, it receives a 
reward of 100. Its final goal is an absorbing state. One 
episode is over when all agents reach their goals. 

2) A negative reward of -10 is received if a collision 
happens or an agent steps out of the border. In these 
cases, agents will bounce to their previous states. 

3) Otherwise, the reward is set to -1 as the power consump¬ 
tion. 

The settings of other parameters are the same for all algo¬ 
rithms: learning rate a = 0.1, discount rate 7 = 0.9, exploration 
factor e = 0.01 for e-greedy strategy. All algorithms run 2000 
iterations for the grid world games and 8000 iterations for the 
intelligent warehouse problem. We use three typical criteria to 
evaluate these MARL algorithms, i.e., steps of each episode 
(SEE), rewards of each episode (REE) and average runtime 
(AR). All the results are averaged over 50 runs. 

A. Tests on grid world games 

The proposed NegoSI algorithm is evaluated in the grid 
world games presented by Melo and Veloso ifTSll . which are 
shown in Eig. |5] The first four benchmarks, i.e., ISR, SUNY, 
MIT and PENTAGON (shown as Eig. |5](a)-(d), respectively), 
are two-agent games, where the cross symbols denote the 
initial locations of each agent and the goals of the other agent. 








































In addition, we design two highly competitive games to further 
test the algorithms, namely, GW_nju (Fig. |5le)) and GWa3 
(Fig- Sf))- The game GW_nju has two agents and the game 
GWa3 has three agents. The initial location and the goal of 
agent i are represented by the number i and G,, respectively. 

We hrst examine the performances of the tested algorithms 
regarding the SEE (steps of each episode) for each benchmark 
map (See Eig. |6]l. The state-of-the-art value function transfer 
mechanism helps NegoQ and ILs to converge fast in all games 
except for SUNY. NegoSI also has good convergence char¬ 
acteristics. CQ-learning, however, has a less stable learning 
curve especially in PENTAGON, GW_nju and GWa3. This 
is reasonable since in these highly competitive games, the 
agents’ prior-learned optimal policies always have conflicts 
and CQ-learning cannot find the equilibrium joint action to 
coordinate them. When more collisions occur, the agents are 
frequently bounced to previous states and forced to take more 
steps before reaching their goals. The learning steps of the 
final episode for each algorithm in each benchmark map are 
shown in Table J] In ISR, NegoSI converges to 7.48, which 
is the closest one to the value obtained by the optimal policy. 
In other cases, NegoSI achieves the learning step of the final 
episode between 105.1% and 120.4% to the best solution. 


Then we analyze the REE (rewards of each episode) cri¬ 
terion of these tested algorithms (See Eig. ITIfTSI). The results 
vary for each map. In ISR, NegoSI generally achieves the 
highest REE through the whole learning process, which shows 
fewer collisions and more cooperations between the learning 
agents. Also, the difference between the two agents’ reward 
values is small, which reflects the fairness of the learning 
process of NegoSI. The agents are more independent in 
SUNY. Each of them has three candidates for the single-agent 
optimal policy. In this setting, ILVET has good performance. 
NegoSI shows its fairness and high REE value compared 
with NegoVET and CQ-learning. The agents in MIT have 
more choices of collision avoidances. All the tested algorithms 
obtain the final reward of around 80. However, the learning 
curves of CQ-learning are relatively unstable. In PENTAGON, 
NegoSI proves its fairness and achieves as good performance 
as algorithms do with the value function transfer. 

Other than these above benchmarks, we give two highly 
competitive games to test the proposed algorithm. In both 
games, the agents’ single-agent optimal policies conflict with 


TABLE I: The learning steps of the final episode of each tested benchmark map. 



ISR 

SUNY 

MIT 

PENTAGON 

GW_nju 

GWal, 

CQ — learning 

8.91 

10.70 

19.81 

15.32 

12.65 

8.65 

ILVFT 

13.11 

10.38 

18.67 

14.18 

10.94 

11.40 

NegoQVFT 

8.36 

12.98 

19.81 

8.55 

10.95 

8.31 

NegoSI 

7.48 

10.92 

21.29 

10.30 

12.11 

8.87 

The optimal policy 

6 

10 

18 

8 

10 

8 



(a) ISR 



fl: 




a 


(c) MIT 
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(d) PENTAGON 


(e) GW_nju (f) GWa3 

Fig. 5: Grid world games. 



(a) ISR 



(c) MIT 



(e) GW_nju 

Fig. 6: SEE (steps of each episode) 



(b) SUNY 



(d) PENTAGON 



(f) GWa3 

for each tested benchmark map. 
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each other and need appropriate coordination. In GW_nju, the 
learning curves of NegoSI finally converge to 91.39 and 90.09, 
which are very close to the optimal final REE values as 93 
and 91, respectively. Similar to NegoSI, NegoVET achieves the 
final REE values of 91.27 and 91.25. However, NegoSI can 
negotiate for a fixed and safer policy that allows one agent to 
always move greedily and the other to avoid collision (while 
NegoVET cannot), which shows the better coordination ability 
of NegoSI. Eor the three-agent grid world GWa3, even though 
NegoSI does not have the lowest SEE, it achieves the highest 
REE through the whole learning process. Eor one thing, this 
result demonstrates the scalability of NegoSI in the three-agent 
setting. Eor another, it shows that the agents using NegoSI have 
the ability to avoid collisions to obtain more rewards while 
the agents using traditional MARL methods have less desire 
to coordinate and therefore lose their rewards. Actually, even 
though we developed our method with non-cooperative multi¬ 
agent model, it does not necessarily mean that the agents are 
egoistic. Thanks to the negotiation mechanism, agents learn 
to benefit themselves while doing little harm to others, which 
shows an evidence of cooperation. 

The results regarding AR (average runtime) are shown in 
Table HI] ILVET has the fastest learning speed, which is only 
five to ten times slower than Q-learning to learn single policy. 
CQ-learning only considers joint states in “coordination state” 
and it also has a relatively small computational complexity. 
The AR of NegoVET is five to ten times more than that 
of ILVET. This is reasonable since NegoVET learns in the 
whole joint state-action space and computes the equilibrium 
for each joint state. The learning speed of NegoSI is slower 
than CQ-learning but faster than NegoVET. Even if NegoSI 
adopts the sparse-interaction based learning framework and 
has computational complexity similar to CQ-learning, it needs 
to search for the equilibrium joint action in the “coordination 
state”, which slows down the learning process. 

B. A real-world application: intelligent warehouse systems 

MARL has been widely used in such simulated domains 
as grid worlds m, but few applications have been found for 



Fig. 13: Simulation platform of the intelligent warehouse system. 


realistic problems comparing to single-agent RE algorithms 
li^ - li^ or MAS algorithms Ii40l - ll42l . In this paper, we apply 
the proposed NegoSI algorithm to an intelligent warehouse 
problem. Previously, single agent path planning methods have 
been successfully used in complex systems is m, however, 
the intelligent warehouse employs a team of mobile robots 
to transport objects and single-agent path planning methods 
frequently cause collisions ll45l . So we solve the multi-agent 
coordination problem in a learning way. 

The intelligent warehouse is made up of three parts: picking 
stations, robots and storage shelves (shown as in Eig. (H. 
There are generally four steps of the order fulfillment process 
for intelligent warehouse systems: 

1) Input and decomposition; input and decompose orders 
into separated tasks; 

2) Task allocation; the central controller allocates the tasks 
to corresponding robots using task allocation algorithms 
(e.g., the Auction method); 

3) Path planning: robots plan their transportation paths with 
a single-agent path planning algorithm; 

4) Transportation and collision avoidance: robots transport 
their target storage shelves to the picking station and 
then bring them back to their initial positions. During 
the transportation process, robots use sensors to detect 
shelves and other robots to avoid collisions. 

We focus on solving the multi-robot path planning and 
collision avoidance in a MARL way and ignore the first two 
steps of the order fulfillment process by requiring each robot 
to finish a certain number of random tasks. The simulation 
platform of an intelligent warehouse system is shown as in 
Eig. fT3l which is a 16 x 21 grid environment and each grid 
is of size 0.5m x 0.5m in real world. Shaded grids are storage 
shelves which cannot be passed through. The state space of 
each robot is made up of two parts: the location and the task 
number. Each robot has 4 actions, namely, “up”, “down”, “left” 
and “right”. 

We only compare the NegoSI algorithm with CQ-learning 
for the intelligent warehouse problem. In fact, ILVET has no 
guarantee of the convergence characteristics and it is difficult 
to converge in the intelligent warehouse setting in practice. 
NegoVET is also infeasible since its internal memory cost in 
MATLAB is estimated to be 208 x 208 x 208 xlOxlOxlOx 
4x4x4x8B = 4291GB, while the costs of other algorithms 
are about 2 MB. Like NegoVET, other MG-based MARL 
algorithms cannot solve the intelligent warehouse problem 
either. Experiments were performed in 2-agent and 3-agent 
settings, respectively, and the results are shown as in Eig. [T4| 
and Eig. [15] 

In the 2-agent setting, the initial position and final goal of 
robot 1 are (1,1) and those of robot 2 are (1,16). Each robot 
needs to finish 30 randomly assigned tasks. The task set is the 
same for all algorithms. Robots with NegoSI achieve lower 
SEE (steps of each episode) than robots with CQ-learning 
throughout the whole learning process (as shown in Eig. 
(Hi. NegoSI finally converges to 449.9 steps and CQ-learning 
converges to 456.9 steps. In addition, robots with NegoSI have 
higher and more stable REE (rewards of each episode). Einally, 
the average runtime for completing all the tasks is 2227s for 


























rewards of each episode rewards of each episode rewards of each episode rewards of each episode rewards of each episode rewards of each episode 


to 






(a) CQ-learning 


(b) ILVFT 


(c) NegoQVFT 


Fig. 7: REE (rewards of each episode) for each tested algorithm in ISR. 





(a) CQ-learning 


(b) ILVFT (c) NegoQVFT 

Fig. 8: REE (rewards of each episode) for each tested algorithm in SUNY. 




- Agent 1 

- Agent 2 


5 10 15 

episodes (xfOO) 

(a) CQ-learning 




(b) ILVFT 


(c) NegoQVFT 


Fig. 9: REE (rewards of each episode) for each tested algorithm in MIT. 





(a) CQ-learning 


(b) ILVFT 


(c) NegoQVFT 


Fig. 10: REE (rewards of each episode) for each tested algorithm in PENTAGON. 





(a) CQ-learning 


(b) ILVFT 


(c) NegoQVFT 


Fig. 11: REE (rewards of each episode) for each tested algorithm in GW_nju. 





(a) CQ-learning 


(b) ILVFT 


(c) NegoQVFT 


(d) NegoSI 



(d) NegoSI 



(d) NegoSI 



(d) NegoSI 



(d) NegoSI 



(d) NegoSI 


Fig. 12; REE (rewards of each episode) for each tested algorithm in GWa3. 
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TABLE II: Average runtime for each tested map. 



ISR 

SUNY 

MIT 

PENTAGON 

GW_nju 

GWa2> 

CQ — learning 

8.54 

5.91 

13.68 

8.52 

5.07 

5.95 

ILVFT 

4.91 

4.14 

9.78 

6.56 

2.53 

7.21 

NegoQVFT 

13.92 

20.18 

36.84 

18.45 

21.89 

50.48 

NegoSI 

16.74 

7.33 

19.58 

16.18 

7.08 

16.41 

QL for Single Policy 

0.51 

0.45 

1.29 

0.05 

0.19 

0.07 


NegoSI and is 3606s for CQ-learning. Robots used 38% less 
time to complete all the tasks with NegoSI than that with CQ- 
learning. 

The performances of NegoSI are also better in the 3-agent 
setting than that of CQ-learning. The initial position and the 
final goal of different robots are (1,1), (1,8) and (1,16). Each 
robot needs to finish 10 randomly assigned tasks. The task 
set is the same for different algorithms. SEE (steps of each 
episode) for robots with NegoSI finally converges to 168.7 
steps and that for robots with CQ-learning converges to 177.3 
steps (as shown in Eig. flSl l. In addition, the learning curves 
of NegoSI are more stable. Robots with NegoSI have higher 
and more stable REE (rewards of each episode). Einally, the 
average runtime for completing all the tasks is 1352s for 
NegoSI and 2814s for CQ-learning. The robots use 52% less 
time to complete all the tasks with NegoSI than that with CQ- 
learning. 

Remark 4: The agents with CQ-learning algorithm learn 
faster than the agents with NegoSI for the benchmark maps 


in Section IIV-AI but slower than those with NegoSI for the 
intelligent warehouse problem in Section IIV-BI The reason is 
that for agents with CQ-learning, the number of “coordination 
state” is several times higher than that with NegoSI. This 
difference becomes significant with the increase of the task 
number, the environment scale and the number of agents. Thus, 
the time used to search for one specific “coordination state” in 
the “coordination state” pool increases faster in CQ-learning 
than in NegoSI, which results in the increase of the whole 
learning time. According to all these experimental results, 
the presented NegoSI algorithm maintains better performances 
regarding such characteristics as coordination ability, conver¬ 
gence, scalability and computational complexity, especially for 
practical problems. 

V. Conclusions 

In this paper a negotiation-based MARL algorithm with 
sparse interactions (NegoSI) is proposed for the learning 
and coordination problems in multi-agent systems. In this 




(a) SEE 


(b) REE 


Fig. 14: SEE (steps of each episode) and REE (rewards of each episode) for NegoSI and CQ-leaming in the 2-agent intelligent warehouse. 
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Fig. 15: SEE (steps of each episode) and REE (rewards of each 


0 yuu 

■D 

--- 

■W *V 

850 

.. 

03 



o 800 
CO 

r Jij 

- NegoSI agenti 

CD 


- NegoSI agent2 

o 750 


. NegoSI agents 

CO 

■D 


- CQlearning agenti 

lo 700 


- - - CQlearning agent2 

5 

03 


' ’' CQlearning agents 

^ 650 



0 20 40 60 80 

episodes (xlOO) 

(b) REE 

for NegoSI and CQ-learning in the 3-agent intelligent warehouse. 





















































12 


integrated algorithm the knowledge transfer mechanism is 
also adopted to improve agent’s learning speed and coordina¬ 
tion ability. In contrast to traditional sparse-interaction based 
MARL algorithms, NegoSI adopts the equilibrium concept 
and makes it possible for agents to select non-strict EDSP 
or Meta equilibrium for their joint actions, which makes it 
easy to find near optimal (or even optimal) policy and to 
avoid collisions as well. The experimental results demonstrate 
the effectiveness of the presented NegoSI algorithm regarding 
such characteristics as fast convergence, low computational 
complexity and high scalability in comparison to the state-of- 
the-art MARL algorithms, especially for practical problems. 
Our future work focuses on further comparison of NegoSI 
with other existing MARL algorithms and more applications 
of MARL algorithms to general and realistic problems. In 
addition, multi-objective reinforcement learning (MORL) ll46l - 
ll48l will also be considered to further combine environment 
information and coordination knowledge for local learning. 
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